D60-D67 Nucleic Acids Research, 2014, Vol. 42, Database issue 
doi:10.1093/nar/gkt952 



Published online 24 October 2013 



uORFdb— a comprehensive literature database on 
eukaryotic uORF biology 

Klaus Wethmar 1 ' 2 '*, Adriano Barbosa-Silva 3 , Miguel A. Andrade-Navarro 3 and 
Achim Leutz 1 ' 4 

1 Max Delbruck Center for Molecular Medicine (MDC), Cell Differentiation and Tumorigenesis, Robert-Rossle- 
Strasse 10, D-13092 Berlin, Germany, hematology, Oncology and Tumor Immunology, Helios Klinikum Berlin- 
Buch, Schwanebecker Chaussee 50, D-13125 Berlin, Germany, 3 Max Delbruck Center for Molecular Medicine 
(MDC), Computational Biology and Data Mining, Robert-Rossle-Strasse 10, D-13092 Berlin, Germany and 
4 Humoldt-University, Department of Biology, Invalidenstrasse 43, D-10115 Berlin, Germany 

Received August 22, 2013; Revised September 25, 2013; Accepted September 26, 2013 



ABSTRACT 

Approximately half of all human transcripts contain 
at least one upstream translational initiation site that 
precedes the main coding sequence (CDS) and gives 
rise to an upstream open reading frame (uORF). We 
generated uORFdb, publicly available at http://cbdm. 
mdc-berlin.de/tools/uorfdb, to serve as a compre- 
hensive literature database on eukaryotic uORF 
biology. Upstream ORFs affect downstream transla- 
tion by interfering with the unrestrained progression 
of ribosomes across the transcript leader sequence. 
Although the first uORF-related translational activity 
was observed >30 years ago, and an increasing 
number of studies link defective uORF-mediated 
translational control to the development of human 
diseases, the features that determine uORF- 
mediated regulation of downstream translation are 
not well understood. The uORFdb was manually 
curated from all uORF-related literature listed at the 
PubMed database. It categorizes individual publica- 
tions by a variety of denominators including taxon, 
gene and type of study. Furthermore, the database 
can be filtered for multiple structural and functional 
uORF-related properties to allow convenient and 
targeted access to the complex field of eukaryotic 
uORF biology. 

INTRODUCTION 

Ribosome profiling of the yeast, mouse and human tran- 
scriptomes uncovered high rates of translation beyond the 
borders of annotated main protein-coding sequences 
(CDSs) (l^f). Most of these non-protein-coding transla- 
tional hot spots are localized within the transcript leader 
sequence of mRNAs (4), where upstream AUG codons or 



alternative upstream initiation codons give rise to 
upstream open reading frames (uORFs). The presence of 
uORFs, which may overlap or terminate upstream of the 
main protein CDS, affects downstream initiation effi- 
ciency and the translation rate of the respective protein 
(Figure 1). 

The regulatory potential of uORFs has first been 
described in the 1980s (5); however, only recently, 
ribosome profiling and a growing list of physiological 
and medical implications attributed an increased level of 
biological significance to uORF-mediated translational 
control (6-9). For example, germ line mutations resulting 
in the de novo generation or functional activation of 
uORFs in two prominent tumor suppressor genes 
(CDKN2A and CDKN1B) were associated with the de- 
velopment of hereditary melanoma and multiple endo- 
crine neoplasia syndrome (MEN4), respectively (9,10). 

The vast majority of experiments focused on the func- 
tional analysis of AUG-initiated uORFs by luciferase 
reporter assays and mostly demonstrated inhibitory 
effects on downstream translation. Exceptionally, uORFs 
can also mediate the paradoxical induction of downstream 
protein translation under unfavorable global translational 
conditions, as intensively studied for the yeast transcription 
factor GCN4 in response to nutrient stresses (1 1). A multi- 
tude of other uORF-related regulatory functions (12,13), 
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Figure 1. Model of a uORF-containing transcript. Two uORFs (blue 
boxes) precede the main ORF of the CDS (white box). Ribosomes may 
initiate at the CDS initiation codon (white flag) after leaky scanning 
through both uORF initiation codons (blue flags), or may reinitiate 
after translating the first uORF and leaky scanning through the 
second uORF initiation site. Ribosomes translating the second 
overlapping uORF will not be available for translation of the CDS. 
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e.g. uORF-directed selection of downstream translational 
initiation sites in mammalian key-regulatory transcription 
factors (14) or the uORF-mediated integration of small 
molecule concentrations determining downstream transla- 
tional activity (15-18), may hold abundant novel thera- 
peutic target sites for medical application. 

Owing to the overwhelming number and transcript- 
specific variability of uORF-related properties and func- 
tions, the biology of uORFs is far from being understood. 
With uORFdb, we generated a comprehensive browsable 
literature database on eukaryotic uORF biology to 
provide a rapid, targeted and convenient overview of 
this developing field. 



LITERATURE REVIEW AND GENERATION OF THE 
DATABASE 

Since February 2010, we applied a Boolean search for 'up- 
stream open reading' or 'uORF' or uORFs' or 'upstream 
initiation' or 'uAUG' or 'small open reading' or 'sORF' or 
'upORF' or 'ribosome profiling' to the NCBI PubMed 
database at http://www.ncbi.nlm.nih.gov/pubmed. On 15 
July 2013, this search returned 981 publications. All ab- 
stracts were curated manually to eliminate non-related ac- 
cidental hits. Furthermore, only publications investigating 
eukaryotic or viral transcripts/uORFs were included, while 
bacterial data were omitted. 

Most importantly, during the curation process, we 
identified a number of numerical, structural, sequential 
and cofactor-related properties that were recurrently 
associated with uORF-mediated regulatory functions. 
All references were screened and indexed for these newly 
defined function-related categories. Additionally, publica- 
tions were categorized by the type of article, by the taxon 
and by the gene investigated. Wherever required, full-text 
articles were analyzed to extract missing information ac- 
cording to the uORFdb denominators. All information 
was collected to build a publicly available browsable 
database at http://cbdm.mdc-berlin.de/tools/uorfdb/. 

The initial release of uORFdb provided links to 467 
uORF-related references covering a wide range of 
species/taxa and genes (Table 1). The comprehensive lit- 
erature survey performed to generate uORFdb revealed 
that only ~100 of the > 10 000 human protein-coding 



Table 1. Content of uORFdb vl.O 



Taxon 


References 


Genes 


Human 


166 


103 


Yeast 


85 


15 


Mouse 


66 


43 


Virus 


50 


17 


Arabidopsis 


28 


14 


Rat 


21 


16 


Others 


52 


47 



The table summarizes the number of references per taxon and the 
number of analyzed genes per taxon contained within the initial 
release of uORFdb. Note that reviews and other manuscript categories, 
which are not restricted to specific transcripts or taxa are not repre- 
sented by this table. 



genes that produce uORF-bearing transcripts have been 
investigated for uORF-mediated translational control 
mechanisms. The proportion of analyzed uORF genes 
for other species is even lower, e.g. ~0.4% for mouse 
and yeast, and ~0.1% for rat. 

Considering the universal prevalence of one or 
more uORF(s) in ~50% of mRNAs in mammalian 
transcriptomes, together with the recently proven high 
rate of uORF-mediated translational activity (4,7), the 
number of reports on functionally important uORFs is 
likely to rapidly increase within the next decade of research. 

FEATURES OF THE DATABASE 

The uORFdb is intended to facilitate convenient and 
targeted access to the complex field of uORF-mediated 
translational control mechanisms by a web-based query 
tool. Making use of manually curated data derived from 
a review of all PubMed-listed uORF-related literature, 
users may query uORFdb by three options: 

I) Query uORF bibliography by gene or taxon. 

A free-text input field at the query page of the web 
interface allows flexible search inputs, including gene 
name, gene symbol, gene alias, NCBI Gene/GenBank 
ID, taxon or taxon common name to identify uORF- 
related references for a specific gene or taxon. 

II) Query uORF bibliography by uORF-related 
properties. 

An individual user-specific literature compilation for 
one or multiple uORF-related properties can be generated 
by simple one-click selections of the respective categories 
on the query page. 

III) Query uORF bibliography by manuscript category. 

Users may limit returned references to specific manu- 
script categories, including protocols, review articles and 
studies characterized by the type of the experimental 
method applied. 

After querying uORFdb, an output page (Figure 2) 
returns a table summarizing all categories met or ad- 
dressed by the respective publications. Wherever 
possible, the output table provides the taxon, official 
gene symbol and accession number for individual 
uORF-bearing genes or transcripts, along with links to 
the corresponding records in the NCBI's Entrez Gene or 
Nucleotide databases for further sequence analysis (19). 
Selection fields next to each reference in the output table 
allow users to directly display an individual set of ab- 
stracts at the PubMed web page for further reading. 
Query results, as well as the complete content of 
uORFdb, may be downloaded from the output page and 
downloads page, respectively. 

TECHNICAL SPECIFICATIONS 

The uORFdb is presented as a Web site developed using 
PHP programming language (version 5.3.2, www.php. 
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Home / Query / Help / Downloads / Curation process / About / Statistics / Release notes / Contact 
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Figure 2. Example of a uORFdb query result. At the first release of uORFdb, filtering for 'human' data and selecting 'tissue specificity' from the 
uORF-related property section returned 12 PubMed IDs, linked to the respective abstracts. The table summarizes all categories met or addressed 
within the returned publications. The green check marks indicate 'positive' evidence for a regulatory function of the respective uORF-related 
property, or that the respective manuscript category is met. The red X symbol identifies 'negative' evidence for a regulatory function of the respective 
uORF-related property (e.g. PMID 8027046 contains evidence that the overlap of the uORF with the CDS does not influence CDS translation). 
Users may sort the output table for a category of choice by clicking on the column header (a white arrowhead indicates active sorting). By checking 
the selection fields on the left, users may select an individual set of abstracts for bulk display via the 'PubMed' button above the table. A yellow 
funnel sign in the header of the table marks active filters. Links to NCBI Gene or GenBank entries allow further sequence analysis, and query results 
may be downloaded for local use via a link below the table. 



net). On selection of desired niters by the database users, a 
server-side PHP script builds a correspondent SQL query 
and executes it on the MySQL system where uORFdb 
data is stored (MySQL Server version 5.1.61). Matching 
records are fetched from the MySQL query result and 
populated into a HTML table to be displayed at the 
user's browser. 

The following section will provide short explanatory 
and summarizing paragraphs on the individual categories 
of uORFdb: 



DETERMINANTS OF uORF PRESENCE OR 
ABSENCE 

• Alternative promoters • Alternative splicing • Tissue- 
specific uORFs 

While AUG is the best conserved trinucleotide within the 
transcript leader sequence of human and mouse (7), the 
general prevalence of uAUGs is lower than expected by 
normal distribution (20). These observations argue for the 
functional importance of uAUGs and for an evolutionary 



negative selection, respectively. In specific cases, the 
presence or absence of one or several uORF(s) is depend- 
ent on the transcript variant produced by transcription 
initiation from alternative promoters or due to alternative 
splicing. For example, the predominant usage of an alter- 
native promoter within the oncogene MDM-2 in tumor 
cells results in the production of a transcript variant 
lacking exonl and two inhibitory uORFs, leading to 
increased translation of MDM-2 protein (21). Tissue-spe- 
cific presence and functional importance of uORFs have 
been reported for a number of human and mouse genes 
including AdipoRl, where a gain of two translational re- 
pressive uORFs in a splicing-derived alternative transcript 
in muscle tissue is implicated in whole-body insulin sensi- 
tivity and glucose tolerance (22). 

• Non-AUG uORFs 

In a recent study using global translational initiation 
sequencing (4), 54% of human transcripts displayed one 
or more translational initiation site(s) preceding the CDS. 
Surprisingly, about three-fourths of upstream translation 
was initiated by near-cognate, non-AUG initiation 
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codons, further relativizing the classical 'first-AUG'-role. 
Nevertheless, uAUG codons appeared to be functionally 
most effective in repressing CDS translation. To date, only 
two publications analyzing human BIRC2 and yeast 
GCN4 have been focusing on non-AUG uORF functions 
at the individual transcript level (23,24). 



STRUCTURAL AND SEQUENCE-DEPENDENT 
uORF PROPERTIES 

• Number • Length • Distance from 5'-cap • Distance 
from uORF-stop to CDS • CDS overlap • RNA second- 
ary structure 

Many publications investigated the importance of struc- 
tural and sequence-dependent uORF properties in 
mediating translational regulation. The impact of uORF 
number, length and position within the transcript leader 
sequence has most intensively been studied in the classical 
model for uORF-mediated translational control, the yeast 
GCN4 transcript (11,25) and in a series of mutational ex- 
periments performed by M. Kozak, reviewed in (26). The 
repression of downstream translation appears to be posi- 
tively correlated with the number of uORFs per transcript, 
the length of the uORF and the distance between the 5'-cap 
structure and the uORF initiation codon. Furthermore, 
translational repression correlates negatively with the 
distance between the uORF-stop and the CDS initiation 
site and is even more profound when the uORF overlaps 
the CDS initiation codon. Together, the experiments 
suggest a dynamic regulatory model, where indispensable 
initiation cofactors detach gradually from ribosomes during 
the elongation phase of uORF translation, but may be 
reloaded to allow reinitiation at the CDS. 



FUNCTIONAL CONSEQUENCES OF 
uORF-MEDIATED TRANSLATIONAL CONTROL 

• CDS repression • CDS induction • Start site selection 

Most uORFs analyzed to date repress translation of the 
subsequent initiation site(s) and inhibit/diminish transla- 
tion of the main protein. Post-uORF initiation at the CDS 
initiation codon may occur after leaky scanning of ribo- 
somes across the uORF initiation codon or by reinitiation 
if the uORF-stop codon precedes the CDS (26). Despite a 
generally repressive function on downstream translation, 
several exceptions have been described, including human 
DDIT3 (15), mouse Atf4 and yeast GCN4 (11), where 
translation of specific uORFs or a certain alignment of 
subsequent uORFs mediate enhanced CDS initiation. 
Furthermore, uORF-directed start site selection can 
result in the production of N-terminally distinct protein 
isoforms that harbor unique biological functions, as 
demonstrated for CEBPA and CEBPB transcription 
factors (14,27,28). 

• Nonsense-mediated decay • mRNA destabilization 

Nonsense-mediated decay (NMD) of mRNA is activated 
when specific cellular surveillance mechanisms detect 
premature termination of protein translation (29). 
Such premature termination events may result from the 



use of nonsense codons that arise in mature transcripts 
due to mutations, incorrect splicing or aberrant initiation 
site selection. Upstream ORFs have been suggested to 
induce NMD by conferring additional termination 
codons to the 5'-leader sequence of certain transcripts. 
Expression profiling in mammalian cells (30), 
Caenorhabditis elegans (31) and yeast (32) revealed an en- 
richment of uORF-containing transcripts in the fraction 
of mRNAs that were targeted by NMD. Similarly, 
another mode of termination-dependent RNA destabiliza- 
tion that is distinct and independent of the common NMD 
pathway has been reported in yeast (33,34). 

• Ribosome load • Ribosome pausing/stalling • Ribosome 
shunting 

Mutational deletion of a uORF can result in increased 
ribosome load on a given transcript associated with 
increased translational activity, as observed for human 
AMD1 (35) and ERBB2 (36). On the contrary, ribosome 
stalling at the uORF termination codon or pausing of 
ribosomes on inhibitory uORF structures (37) may 
hamper CDS translation. In specific cases, such as the 
Arabidopsis transcription factor GBF6, binding of a 
small molecule cofactor (sucrose) to the nascent uORF- 
peptide induced stalling of ribosomes at the uORF termin- 
ation codon and resulted in decreased translational 
initiation at the CDS (38). Additional examples of 
ribosome stalling or pausing due to the interaction of 
uORF-peptides with regulatory small molecules entail 
the translational repression of mammalian AMD1 by 
polyamines (39,40) or repression of yeast CPA1 and 
Neurospora crassa Arg2 by arginine (16). 

Underlining the multiplicity of uORF-mediated trans- 
lational regulation, certain uORFs may facilitate 
enhanced CDS translation by supporting a ribosome 
shunt across a highly structured and inhibitory transcript 
leader sequence, as best studied for Cauliflower mosaic 
virus 35S RNA (41). 



CO-REGULATORY EVENTS AFFECTING uORF 
FUNCTIONS 

• Kozak consensus sequence 

Whether or not the ternary preinitiation complex recog- 
nizes an AUG or non-AUG triplet as a translational start 
codon is strongly influenced by the nucleotide context sur- 
rounding it. Extensive sequence analysis (20,42) as well as 
mutational analysis (26,43,44) identified crucial nucleotide 
residues in the context of an AUG triplet that create 
favorable or unfavorable surroundings for translational 
initiation. The optimal surrounding sequence for initi- 
ation is GCCRCCAUGG (also called optimal Kozak con- 
sensus sequence; R representing a purine base; most 
important residues underlined). Initiation sequence 
contexts are frequently classified as strong (both critical 
residues match the consensus sequence), adequate/inter- 
mediate (either residue —3 or +4 matches) or weak (both 
critical residues do not match) (45). If the AUG codon 
is surrounded by a strong context, virtually 
all scanning ribosomes will stop and initiate 
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translation. When the surrounding context is weak, many 
ribosomes may scan past the AUG codon and instead 
initiate at one further downstream. Since the quality of 
the Kozak consensus sequence is not the only determinant 
of translation initiation efficiency, the mere evaluation of 
the surrounding nucleotides does not permit the precise 
prediction of initiation. 

• Translational status 

Regulation through uORFs may integrate the overall trans- 
lation status of a cell and adjust the translation rate of 
important regulatory proteins. This was first described in 
a series of experiments on the yeast transcription factor 
GCN4, where four subsequent uORFs control the para- 
doxical translation initiation of the main protein while 
global translation is shut down (11,46-48). Briefly, under 
favorable translational conditions with high levels of the 
eIF2-GTP-Met-tRNAj VIet ternary complex, a fraction of 
the ribosomes that translate the GCN4 uORFl reinitiate 
at the inhibitory uORF4, detach from the mRNA at the 
uORF4-stop codon and thus inhibit translation of GCN4. 
Under starving conditions, low availability of the ternary 
complex causes delayed restoration of a functional pre-ini- 
tiation ribosomal complex after translation of uORFl. This 
results in leaky scanning across the uORF4 initiation codon 
and permits translation of the GCN4 CDS only after pro- 
longed post-termination scanning. 

Similar mechanisms depending on the translational 
status of a cell have been described for the mammalian 
transcription factors ATF4 (49), ATF5 (50), CEBPA 
and CEBPB (14), and the macrophage receptor protein 
CD36 (51). 

• Termination (context) 

The sequence context surrounding a uORF termination 
codon may determine the reinitiation efficiency at down- 
stream initiation sites. In particular, stable interactions 
between the terminating ribosome and the RNA, or 
stable base pairing of the RNA alone may cause ribosomal 
pausing or mediate premature mRNA decay (34,52). 

• uORF RNA/peptide sequence • Regulatory sequence 
motif • Cofactor/ribosome interaction 

Specific RNA sequences may influence CDS translation 
by forming stable secondary structures, by binding to a 
regulatory cofactor or by direct interaction with the 
translating ribosome. Furthermore, uORF-encoded 
peptides may induce ribosome stalling and inhibit down- 
stream translation on binding of their respective small 
molecule interactors, as demonstrated for the sucrose 
control peptide of Arabidopsis GBF6 (38) or the arginine 
attenuator peptide of Neurospora ARG2 (53). For other 
transcripts, including the HHV-5 gp48 mRNA (54), the 
DNA damage-inducible transcript 3 (DDIT3/CHOP/ 
CEBPQ (55) and the vasopressin Vlb receptor (56), trans- 
lational repression by uORF-encoded peptides has been 
described without detailed analysis of the mechanism 
involved. A subset of ~200 human uORFs was suggested 
to encode unique functional peptides based on a high 
degree of amino acid sequence conservation (57). 

Except for the Kozak consensus sequence, to date only 
few uORF-related co-regulatory RNA sequence motives 



have been identified. The most prominent example was 
described for Drosophila msl-2, where a protein inter- 
action RNA-motif facilitates binding of the cofactor 
protein SXL that enhances uORF initiation and thereby 
represses translation of the CDS (58). In yeast GCN4, 
reinitiation-promoting elements have been identified sur- 
rounding uORFl, which interact with eukaryotic initi- 
ation factor 3a to facilitate downstream reinitiation (59). 
Recently, the h-subunit of eIF3 was found to promote 
reinitiation after translation of a reinitiation-permissive 
uORF (60). To what extent 'specialized ribosomes' 
interact with uORFs and other clv-regulatory RNA 
elements to regulate translation awaits investigation (61). 

MEDICAL IMPACT 

• Disease-related uORFs • Acquired mutations • SNPs 

Defects in uORF-mediated translational control may 
result in the development of human disease. Loss of a 
uORF in a mutation-related alternative splicing product 
of the thrombopoietin gene drives enhanced translation of 
thrombopoietin and causes hereditary thrombocytosis 
(62). The roles of uORF-related mutations in CDKN2A 
and CDKN1B for cancer development were mentioned 
above (9,10). Marie Unna hereditary hair loss is caused 
by a variety of mutations altering a uORF within the 
hairless homolog (HR) transcript, resulting in increased 
expression of hairless homolog protein (8). Additional 
uORF-altering mutations were identified by computa- 
tional analysis of the Human Gene Mutation Database 
(7). Diseases with a confirmed implication of uORF mu- 
tations include Cystic fibrosis (CFTR) (63), the van der 
Woude syndrome (IRF6), hereditary pancreatitis 
(SPINK1), familial hypercholesterolemia (LDLR) and 
some others (7). Furthermore, the expression of the beta 
secretase BACE1, related to Alzheimer's disease (64), or 
the transmembrane receptor tyrosine kinase ERBB2, 
related to breast cancer (65), is at least partially controlled 
by uORFs. Whether deregulated uORF-mediated transla- 
tional control is the crucial pathogenic event in these latter 
cases remains to be established. 

Despite few unequivocal cases at this time, it is evident 
that uORF mutations may be involved in a wide variety of 
diseases, including malignancies, metabolic or neurologic 
disorders and inherited syndromes. Considering that 
many important regulatory proteins, including cell 
surface receptors, tyrosine kinases and transcription 
factors, act in a dose-dependent fashion and posses 
uORFs, we speculate that a substantial number of as yet 
unexplained pathologies will be traced back to uORF mu- 
tations altering expression levels of such key regulatory 
genes. 

MANUSCRIPT CATEGORIES 

• Mouse models • Ribosome profiling • Bioinformatics/ 
arrays/screens • Proteomics 

To date, two genetically altered mouse models have 
been generated, confirming the pathogenic role of loss- 
of-uORF mutations in HR resulting in Marie Unna 
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hereditary hypotrichosis in humans (66) and validating the 
physiological importance of the CEBPB uORF in cellular 
differentiation and proliferation (6), respectively. 

Recent progress in computational and sequencing- 
based technologies and the development of the ribosome 
profiling method (3) have generated a large amount of 
information on uORF localization, initiation codon 
usage and uORF function in response to altered transla- 
tional conditions (2). Nevertheless, it is yet not possible to 
predict whether a uORF is translated or has a regulatory 
role from sequence information only. 

Proteomic studies have identified a number of poten- 
tially functional uORF-encoded peptides in human cells 
(67,68). In the human K562 cell line, 40% of small ORF- 
encoded peptides detected by mass spectrometry 
originated from transcript leader sequences (69). 

OUTLOOK AND FURTHER DEVELOPMENT 
OF uORFdb 

The uORFdb is intended to grow concomitantly to the 
publication of novel uORF-related literature in respect 
to the number of references listed and the amount of 
categorized uORF-related properties. We aim to con- 
stantly improve the quality and completeness of indexing 
applied to individual references and invite users to send 
feedback, additions and corrections via the contact page 
of the uORFdb Web site. 
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